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Abstract 

Tensor factorization arises in many machine 
learning applications, such as knowledge base 
modeling and parameter estimation in latent 
variable models. However, numerical meth¬ 
ods for tensor factorization have not reached 
the level of maturity of matrix factorization 
methods. In this paper, we propose a new al¬ 
gorithm for CP tensor factorization that uses 
random projections to reduce the problem 
to simultaneous matrix diagonalization. Our 
method is conceptually simple and also ap¬ 
plies to non-orthogonal and asymmetric ten¬ 
sors of arbitrary order. We prove that a small 
number random projections essentially pre¬ 
serves the spectral information in the ten¬ 
sor, allowing us to remove the dependence 
on the eigengap that plagued earlier tensor- 
to-matrix reductions. Experimentally, our 
method outperforms existing tensor factor¬ 
ization methods on both simulated data and 
two real datasets. 

1 Introduction 

Given a tensor T € of the following form: 

k 

T — TTitti 0 0 Ci -|- noise, (1) 

i=l 

our goal is to estimate the factors ai,bi,Ci S and 
factor weights tt G K^'. In machine learning and statis¬ 
tics, this tensor T typically represents higher-order re¬ 
lationships among variables, and we would like to un¬ 
cover the salient factors that explain these relation¬ 
ships. This problem of tensor factorization is an im¬ 
portant problem rich with applications [1]: modeling 
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knowledge bases [2], topic modeling [3], community 
detection [4], learning graphical models [5, 6]. The 
last three fall into a class of procedures based on the 
method of moments for latent-variable models, which 
are notable because they provide guarantees of consis¬ 
tent parameter estimation [7]. 

However, tensors, unlike matrices, are fraught with dif¬ 
ficulties: identifiability is a delicate issue [8, 9, 10], and 
computing Equation 1 is in general NP-hard [11, 12]. 
In this work, we propose a simple procedure to reduce 
the problem of factorizing tensors to that of factor¬ 
izing matrices. Specifically, we first project the ten¬ 
sor T onto a set of random vectors, producing a set 
of matrices. Then we simultaneously diagonalize the 
matrices, producing an estimate of the factors of the 
original tensor. We can optionally refine our estimate 
by running the procedure using the estimated factors 
rather than random vectors. Our approach applies to 
orthogonal, non-orthogonal and asymmetric tensors of 
arbitrary order. 

From a practical perspective, this approach enables 
us to immediately leverage mature algorithms for ma¬ 
trix factorization. Such algorithms often have readily 
available implementations that are numerically stable 
and highly optimized. In our experiments, we observed 
that they contribute to improvements in accuracy and 
speed over methods that deal directly with a tensor. 

From a theoretical perspective, we consider both sta¬ 
tistical and optimization aspects of our method. Most 
of our results pertain to the former: we provide guar¬ 
antees on the accuracy of a solution as a function of the 
noise e (this noise typically comes from the statistical 
estimation of T from finite data) that are comparable 
to those of existing methods (Table 1). Algorithms 
based on matrix diagonalization have been previously 
criticized [7] to be extremely sensitive to noise due to a 
dependence on the smallest difference between eigen¬ 
values (the eigengap). We show that this dependence 
can be entirely avoided using just O(logfc) tensor pro¬ 
jections chosen uniformly at random. Furthermore, 
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our guarantees are independent of the algorithm used 
for diagonalizing the projection matrices. 

The optimization aspects of our method, on the other 
hand, depend on the choice of joint diagonalization 
subroutine. Most subroutines enjoy local quadratic 
convergence rates [13, 14, 15] and so does our method. 
With sufficiently low noise, global convergence guaran¬ 
tees can be established for some joint diagonalization 
algorithms [16]. More importantly, local optima are 
not an issue for our method in practice, which is in 
sharp contrast to some other approaches, such as ex¬ 
pectation maximization (EM). 

Finally, we show that our method obtains accuracy 
improvements over alternating least squares and the 
tensor power method on several synthetic and real 
datasets. On a community detection task, we obtain 
up to a 15% reduction in error compared to a recently 
proposed approach [4], and up to an 8% reduction in 
error on a crowdsourcing task [17], matching or out¬ 
performing a state-of-the-art EM-based estimator on 
three of the four datasets. 

Notation Let [n] = {1,... , n} denote the first n pos¬ 
itive integers. Let be the indicator vector which is 
1 in component i and 0 in all other components. We 
use 0 to denote the tensor product: if u,v,w S 
then It (g) w (g) w £ ^dxdxd i third order tensor 

T £ ^dxdxd .^g giggjjg vector and matrix application 
as, 

d d d 

T{x,y,z) = EEE 

i—1 j—1 k—1 

d d d 

T{X,Y,Z),,k = EEE TlmnXliY^j , 

1—1 m—1 n—1 

for vectors x,y,z £ and matrices X,Y,Z £ 
Rdxk^ The partial vector application (or projection) 
T{I,I,w) of a vector u> £ returns a, d x d matrix: 
r(J, J, w)^j = TijkWk- 

We define the CP decomposition of a tensor T £ 
^dxdxd gg p _ ^:^ai Ci, for a^, 5^, Ci £ R‘^. 

The rank of T is said to be k. When Oi = bi = Ci = Ui 
for all i, and the UiS are orthogonal, we sa^ T has a 
symmetric orthogonal factorization, T = . 

Projecting a tensor T = TTiOi ® bi® Ci along w 

produces a matrix T{I,I,w) = bt. 

We use Xi = TTi{cJw) to refer to the factor weights (or 
eigenvalues in the orthogonal setting) of the projected 
matrix. 

^ We will only consider third order tensors for the re¬ 
mainder of this paper, thongh the approach natnrally ex¬ 
tends to tensors of arbitrary order. 


Method 

p < 

\\Ui - Ui\\2 < 

Gonv. 

TPM [7] 

0 

e 

'Tl'min 

G 

Givens [18] 

0 

? 

G 

ALS [19] 

polylog((i) 

y/d 

e 1 y/kJdX^ 

^min ’^min 

L 

SD2 [20] 

0 


G 

c 

TTminCmini^j ki-u-jl)'" 

This paper 

1 

(1 —/F2)7r^in 

L/G 


Table 1: Comparison of tensor factorization algo¬ 
rithms (Section 2.1). For a tensor with noise e (Equa¬ 
tion 1) and allowed incoherence fa, we show an upper 
bound on the error in the recovered factors jjui — iti ||2 
and whether the convergence is (L)ocal or (G)lobal. 
The factor weights tt are assumed to be normalized 
(||7r||i = 1). ||17“^||2 is the 2-norm of the inverse of 
factors U~^. Our method allows for arbitrary incoher¬ 
ence with a sensitivity to noise comparable to existing 
methods ([20, 7, 19]), and with better empirical per¬ 
formance. In the orthogonal setting, our algorithm is 
globally convergent for sufficiently small e. 

For a vector of values tt £ , we use TTmin and TTmax to 

denote the minimum and maximum absolute values of 
the entries, respectively. Finally, we use Sij to denote 
the indicator function, which equals 1 when i = j and 
0 otherwise. 

2 Background 

In this section, we establish the context for tensor fac¬ 
torization, method of moments for estimating latent- 
variable models, and simultaneous matrix diagonaliza¬ 
tion. 

2.1 Tensor factorization algorithms 

Existing tensor factorization methods vary in their 
sensitivity to noise e in the tensor, their tolerance of 
non-orthogonality (as measured by the incoherence p) 
and in their convergence properties (Table 1). The 
robust tensor power method (TPM, [7]) is a popu¬ 
lar algorithm with theoretical guarantees on global 
convergence. A recently-developed coordinate-descent 
method for orthogonal tensor factorization based on 
Givens rotations [18] is empirically more robust than 
the TPM; however it is limited to the full-rank setting 
and lacks a sensitivity analysis. A further limitation 
of both methods is that they only work for symmetric 
orthogonal tensors. Asymmetric non-orthogonal ten¬ 
sors could be handled by preprocessing and whitening, 
but this can be a major source of errors in itself [21]. 
Alternating least squares (ALS) and other gradient- 
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based methods [22] are simple, popular, and apply to 
the non-orthogonal setting, but are known to easily 
get stuck in local optima [23]. Anandkumar et al. [19] 
explicitly show both local and global convergence guar¬ 
antees for a slight modification of the ALS procedure 
under certain assumptions on the tensor T. 


simultaneous diagonalization, we are given a set of 
symmetric matrices Mi,..., Ml S (see Section 6 
for a reduction from the asymmetric case), where each 
matrix can be expressed as 

Mi=UAiU^ +eRi. (2) 


Finally, some authors have also proposed using simul¬ 
taneous diagonalization for tensor factorization: Lath- 
auwer [23] proposed a reduction, but it requires form¬ 
ing a linear system of size O(d^) and is quite com¬ 
plex. Anandkumar et al. [20] performed multiple ran¬ 
dom projections, but only diagonalized two at a time 
(SD2), leading to unstable results; the method also 
only applies to orthogonal factors. Anandkumar et al. 
[7] briefly remarked that using all the projections at 
once was possible but did not pursue it. In contrast, 
our method, has comparable bounds to the tensor 
power method in the orthogonal setting (convention¬ 
ally j|7rj|i = 1 is assumed), and the ALS method in 
the non-orthogonal setting. Furthermore, in the non- 
orthogonal setting, our method works for arbitrary in¬ 
coherence as long as the factors U are non-singular. 

2.2 Parameter estimation in mixture models 

Tensor factorization can be used for parameter esti¬ 
mation for a wide range of latent-variable models such 
as Gaussian mixture models, topic models, hidden 
Markov models, etc. [7]. For illustrative purposes, we 
focus on the single topic model [7], dehned as follows: 
For each of n documents, draw a latent “topic” h G [fc] 
with probability P[h = j] = and three observed 
words Xi,X 2 ,X 3 G {ci,..., e^}, which are condition¬ 
ally independent given h with P[xj = w \ h = i] = Uiw 
for each j G {1,2,3}. The parameter estimation task is 
to output an estimate of the parameters (tt, 
given n documents {{x^i\ X 2 \ (importantly, 

the topics are unobserved). 

Traditional approaches typically use Expectation 
Maximization (EM) to optimize the marginal log- 
likelihood, but this algorithm often gets stuck in lo¬ 
cal optima. The method of moments approach is to 
cast estimation as tensor factorization: define the em¬ 
pirical tensor T = ^ ® ^ 2 ^ G ^ 3 *^- I* can be 

shown that T = + ei? (a refinement 

of Equation 1), where ei? G is the statistical 

noise which goes to zero as n —)■ oo. A tensor factoriza¬ 
tion scheme that asymptotically recovers estimates of 
(tt, {uilJkj) therefore provides a consistent estimator 
of the parameters. 

2.3 Simultaneous diagonalization 

We now briefly review simultaneous matrix diagonal¬ 
ization, the main technical driver in our approach. In 


The diagonal matrix A/ G and the noise eRi 

are individual to each matrix, but the non-singular 
transform U G is common to all the matrices. 

We also define the full-rank extensions. 


[/=[[/ G-L] 


A, 

0 0 ’ 


(3) 


where the columns of !/■*' G span the orthog¬ 
onal subspace of U and A; G has been appropri¬ 

ately padded with zeros. Note that UAiU^ = UAiU^. 

The goal is to find an invertible transform V~^ G 
such that each MiV~^ is nearly diagonal. We re¬ 
fer to the V~^ as inverse factors. When e = 0, this 
problem admits a unique solution when there are at 
least two matrices [24]. There are a number of objec¬ 
tive functions for finding V [25, 13, 26], but in this 
paper, we focus on a popular one that penalizes off- 
diagonal terms: 

L 

F{X) A ^off(X-iMzA-T), off(A) = ^ (4) 

l=l i^j 

An important setting of this problem, which we refer 
to as the orthogonal case, is when we know the true 
factors U to be orthogonal. In this case we constrain 
our optimization variable X to be orthogonal as well, 
i.e. X-i = A^. 

In principle, we could just diagonalize one of the ma¬ 
trices, say Ml (assuming its eigenvalues are distinct) 
to recover U. However, when e > 0, this procedure 
is unreliable and simultaneous diagonalization greatly 
improves on robustness to noise, as we will witness in 
Section 4. 

There exist several algorithms for optimizing F(X). In 
this paper, we will use the Jacobi method [27, 25] for 
the orthogonal case and the QRJID algorithm [26] for 
the non-orthogonal case. Both techniques are based on 
same idea of iteratively constructing X~^ via a prod¬ 
uct of simple matrices X~^ = Bt ■ ■ ■ B 2 B 1 , where at 
each iteration t = 1,... ,T, we choose Bt to minimize 
F{X). Typically, this can be done in closed form. 

The Jacobi algorithm for the orthogonal case is a sim¬ 
ple adaptation of the Jacobi method for diagonalizing 
a single matrix. Each Bt is chosen to be a Givens ro¬ 
tation [27] defined by two of the d axes i < j G [d\-. 
Bt = {cos 6){Xii + Ajj) -I- (sin0)(Aij — Aji) for some 





Tensor Factorization via Matrix Factorization 


angle 0, where is a matrix that is 1 in the (z, j)-th 
entry and 0 elsewhere. We sweep over all i < j, com¬ 
pute the best angle 9 in closed form using the formula 
proposed by Cardoso and Souloumiac [25] to obtain 
Bt, and then update each Mi by BtMiBj . The above 
can be done in 0{<PL) time per sweep. 

For the non-orthogonal case, the QRJID algorithm is 
similar, except that Bt is chosen to be either a lower or 
upper unit triangular matrix (Bt = I + aAij for some 
a and i ^ j)- The optimal value of a that minimizes 
F{X) can also be computed in closed form (see [26] for 
details). The running time per iteration is the same 
as before. 

3 Tensor factorization via 

simultaneons matrix diagonalization 

We now outline our algorithm for symmetric third or¬ 
der tensors. In Section 6, we describe how to gener¬ 
alize our method to arbitrary tensors. Observe that 
the projection of T = along a vector w 

is a matrix T(I,I,w) = Ui)uf'^ that pre¬ 

serves all the information about the factors Ui (as¬ 
suming the TTi^w^Ui)’s are distinct). In principle one 
can recover the Ui through an eigendecomposition of 
T{I,I,w). However, this method is sensitive to noise: 
the error \\ui — Ui \\2 of an estimated eigenvector Ui 
depends on the reciprocal of the smallest eigengap 
uiaxj^i 1/jAi — Aj] of the projected matrix (recall that 
Xi = TTi{w^Ui)), which can be large and lead to inac¬ 
curate estimates. 

Instead, let us obtain the factorization of T from pro¬ 
jections along multiple vectors wi,W2,--- ,wl- The 
projections produce matrices of the form Mi = 
with Xu = TTiwJUi] they have common 
eigenvectors, and therefore can be simultaneously di¬ 
agonalized. As we will show later, joint diagonaliza¬ 
tion is sensitive to the measure rnini^^ ^j^j^(Ai/ — 
~ ^ which averages the mini¬ 

mum eigengap across the matrices Mi (here, Xu = 
TTiiwJu^)). 

A natural question to ask is along which vectors (wi) 
should we project? In Section 4 and Section 5 we show 
that (a) estimates of the inverse factors (vi) are a good 
choice (when the (vi) are approximately orthogonal, 
they are close to the factors (ui)) and that (b) ran¬ 
dom vectors do almost as well. This suggests a simple 
two-step method: (i) first, we find approximations of 
the tensor factors by simultaneously diagonalizing a 
small number of random projections of the tensor; (ii) 
then we perform another round of simultaneous diag¬ 
onalization on projections along the inverse of these 
approximate factors. Algorithm 1 describes the ap¬ 
proach. Its running time is 0{k'^cfis), where s is the 


Algorithm 1 Two-stage tensor factorization algo- 
rithm ____ 

Require: f = T + eRG where T has a CP 

decomposition T = J2i=i Lq > 2 

Ensure: Estimates of factors, tt, fti, • • • ,Uk- 

1: Dehne •(— {T{I,I,wi)}f‘Xli with {wi}^2i 
chosen uniformly from the unit sphere . 

2: Obtain factors and their inverse 

from the simultaneous diagonalization of 

3: Dehne ^ {T{IJ 

4: return Factors and factor weights 

{^i}i=i from simultaneously diagonalizing 


number of sweeps of the simultaneous diagonalization 
algorithm. 

4 Perturbation analysis for orthogonal 
tensor factorization 

In this section, we will focus on the orthogonal setting, 
returning to non-orthogonal factors in Section 5. For 
ease of exposition, we restrict ourselves to symmet¬ 
ric third-order orthogonal tensors: T = 

Here the inverse factors (vi) are equivalent to the fac¬ 
tors (ui), and we do not distinguish between the two. 
The proofs for this section can be found in Appendix 
B. 

Our sensitivity analysis builds on the perturbation 
analysis result for the simultaneous diagonalization of 
matrices in Cardoso [28]. 

Lemma 1 (Cardoso [28]). Let Mi = UAiU^ + eRi, 
I G \L\, he matrices with common factors U G 
and diagonal A/ G Let U G be a full- 

rank extension of U with columns ui,M 2 ,...,Md and 
let U G be the orthogonal minimizer of the joint 

diagonalization objective F{-). Then, for all Uj, j € 
[fc], there exists a column Uj of U such that 


\\uj-Uj\\2 < e. 


\ i=i 


where E € 


IS 




EjUj 


( 6 ) 


when i ^ j and i < k or j < k. We define Eij = 0 
when i = j and Xu = 0 when i > k. 


In the tensor factorj^ation setting, we jointly diag¬ 
onalize projections Mi, I = I,2,...,L of the noisy 
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tensor T along vectors wf. Mi = T{I,I,wi) = 
Ui)uf'^ + where Ri = 

R{I,I,wi) has unit operator norm. Cardoso’s lemma 
provides bounds on the accuracy of recovering the Ui 
via joint diagonalization; in particular, we can further 
rewrite Equation 6 in the tensor setting as: 


Eti ’ 


( 7 ) 


where pij = {niUi — T^jUj) and = R{ui,Uj, I). 

Equation 7 tells us that we can control the mag¬ 
nitude of the Eij (and hence the error on recov¬ 
ering the Ui) through appropriate choice of the 
projections (wi). Ideally, we would like to en¬ 
sure that the projected eigengap, mirii^j wjpij = 
mini^j Ui) — TTj{wJUj)), is bounded away from 

zero for at least one Mi so that the denominator of 
Equation 7 does not blow up. 


Random projections The first step of Algorithm 
1 projects the tensor along random directions. The 
form of Equation 7 suggests that the error terms, Eij, 
should concentrate over several projections and we will 
show that this is indeed the case. Consequently, the er¬ 
ror terms will depend inversely on the mean of wjPij , 
llPiilli = +’’’!> ^min- ^ur final result is as follows: 

Theorem 1 (Tensor factorization with random pro¬ 
jections). Let wi,... ,wl he i.i.d. Gaussian vectors, 
wi ^ A/'(0,/), and let the matrices Mi G be 

constructed via projection of T along wi,..., wl- Let 
Ui be estimates of the Ui derived from the Mi. Let 
L > 161og(2d(fc— l)/d)^. Then, with probability at 
least 1 — (5, for every Ui, there exists a Ui such that 


||Uz - Ui\\2 < 


( 2Y/2||7r||i7rmax 

I 



e + o{e), 


where C{S) = O 


(\og{kd)/5)^y 


The first of the above two terms is the fundamental 
error in estimating a noisy tensor T; the second term 
is due to the concentration of random projections and 
can be made arbitrarily small by increasing L. 


Plug-in projections The next step of our algo¬ 
rithm projects the tensor along the approximate fac¬ 
tors from step 2. Intuitively, if the wi are close to the 
eigenvectors Ui, then wJPij = wJ{'KiUi — TijUj) ~ iTiSu. 
Then for each i j, there is some projection that en¬ 
sures that Eij is bounded and does not depend on the 
projected eigengap mini^j Ui) — tt{wJU j)7 

Theorem 2 (Tensor factorization with plug-in projec¬ 
tions). Let Wl,... ,Wk be approximations ofui,...,Uk: 


||w; — Ui ||2 = 0(e), and let M G be constructed 

via projection of T along wi,... ,Wk. Let Ui he esti¬ 
mates of the Ui derived from the Mi. Then, for every 
Ui, there exists a Ui such that 


I 2iy||7r||i7rmax , , 

h < — -2-e + o(e)- 


Note that Theorem 1 says that with 0{d) random 
projections, we can recover the eigenvectors Ui with 
almost the same precision as if we used approximate 
eigenvectors, with high probability. Moreover, as L ^ 
oo, there is no gap between the precision of the two 
methods. Theorem 2 on the other hand suggests that 
we can tolerate errors on the order of 0(e) without sig¬ 
nificantly affecting the error in recovering Ui. In prac¬ 
tice, we find that using the plug-in estimates allows us 
to improve accuracy with fewer random projections. 


5 Perturbation analysis for 

non-orthogonal tensor factorization 


We now extend our results to the case when the tensor 
T has a non-orthogonal symmetric CP decomposition: 
T = Ei=i where the Ui are not orthogonal and 

k < d. We parameterize the non-orthogonality using 
incoherence: p, = maxi^j ul Uj and the norm of the in¬ 
verse factor IIII 2 where V = U~^. Compared to the 
orthogonal setting, our bounds reveal an O ( 1 

dependence on incoherence. Note that unlike previous 
work, our algorithm does not require an explicit bound 
on p (i.e. any p < 1 is sufficient), as long as the factors 
U are non-singular. Proofs for this section are found 
in Appendix C. 


We base our analysis on the perturbation result by 
Afsari [24]. 


Lemma 2 (Afsari [24]). Let Mi = UAiU^ eRi, I G 
[L], be matrices with common factors U G and 

diagonal A; G Let U G be a full-rank 

extension ofU with columns Ui,U 2 , ■ ■. ,Ud and let V = 
U~^, with rows vi,V 2 , ■ ■ ■ ,Vd. Let U G be the 

minimizer of the joint diagonalization objective F{-) 
and let V = tJ~^. 


Then, for all Uj, j G [k], there exists a column Uj of 
U such that 


||w,- - w,-||2 < e 


\ 


E 




3(e), 


( 8 ) 
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where the entries of E G are bounded by 


\E^^ < 


1-P?, VIIA 


RiVjX^ 


■3 1 


llA.Ili 

L 




1^1 


when i ^ j and Eij = 0 when i = j and Xu = 0 when 
i > k. Here Xi = {Xu, Xi 2 , ■■■, Xu,) G and pij = 
11 ^ 11 ^ is the modulus of uniqueness, a measure of 
how ill-conditioned the problem is. 


In the orthogonal case, we had a dependence on the 
eigengap Xi — Xj. Now the error crucially depends on 
the modulus of uniqueness, pij. The non-orthogonal 
simultaneous diagonalization problem has a unique so¬ 
lution iff I Pij I < 1 for all i ^ j [24]. In the orthogo¬ 
nal case, Pij =0. It can be shown that pij can once 
again be controlled by appropriately choosing the pro¬ 
jections {wi). 

To get a handle on the difficulty of the problem, let us 
assume that the vectors Ui are incoherent: ujuj < p. 
for all i ^ j. Intuitively, the problem is easy when 
/i « 0 and hard when p, « 1. 


Random projections Intuitively, random projec¬ 
tions are isotropic and hence we expect the projections 
Xi and Xj to be nearly orthogonal to each other. This 
allows us to show that pij < 0{p), which matches our 
intuitions on the difficulty of the problem. Our final 
result is the following: 

Theorem 3 (Non-orthogonal tensor factorization 
with random projections). Let wi,... ,wl bei.i.d. ran¬ 
dom Gaussian vectors, wi ^ J\f{0,I), and let the 

matrices Mi G be constructed via projection 

of T along wi,...,wl. Assume incoherence p on 

{ui): vjUj < p. Let Lq = and let L > 

Lolog(15d(fc — 1)/^)^. Then, with probability at least 
1 — S, for every Ui, there exists a Ui such that 


< O 


^/|| 7r II iTTniax ||F II 2 
^min 1 - iR 


(1 -t C{S)) e o(e). 


where C{5) = 


{\og{kd/5)^y 


Once again, the error decomposes into a fundamental 
recovery error and a concentration term. Note that 
the error is sensitive to the smallest factor weight, 
Ti'min- This dependence arises from the sensitivity of 
the non-orthogonal factorization method to the Xi with 
the smallest norm and is unavoidable. 


Plug-in projections When using plug-in estimates 
for the projections, two obvious choices arise: esti¬ 
mates of the columns of the factors, {ui), or the rows 


of the inverse, {vi). Using estimates of {ui) leads to 
Pij < 0{p), similar to what we saw with random pro¬ 
jections. However, using estimates of {vi) ensures that 
the Xi are nearly orthogonal, resulting in pij « 0! This 
leads to estimates that are less sensitive to the inco¬ 
herence p. 

Theorem 4 (Non-orthogonal tensor factorization 
with plug-in projections). Let wi,...,Wk be approx¬ 
imations of vi,... ,Vk: \\wi — vi \\2 < 0{e), and let the 
matrices Mi G be constructed via projection of 

T along wi,... ,Wk. Also assume that the Ui are in¬ 
coherent: uJ Uj < p when j ^ i. Then, for every Uj, 
there exists a Uj such that 


\Uj - Uj\\2 < O 


■V^ll^ll l^max 


11^ 


T||3 


,(e). 


6 Asymmetric and higher-order 
tensors 

In this section, we present simple extensions to the 
algorithm to asymmetric and higher order tensors. 


Asymmetric tensors We use a reduction to han¬ 
dle asymmetric tensors. Observe that the l-th pro¬ 
jection Ml of an asymmetric tensor has the form 
— 'l2i = UAiV^, for some diagonal (not 

necessarily positive) matrix A/ and common U, V, not 
necessarily orthogonal. For each Mi, define another 

matrix Ni = \ ) and observe that 

‘ \Mi 0 J 


r 0 M^i 

1 

V V ' 


A/ 0 ■ 


V V ' 

1 

' o 

§ 

“ 2 

U -U 


1 - 

o 

1 

> 


u -u_ 


The {Ni) are symmetric matrices with common (in 
general, non-orthogonal) factors. Therefore, they can 
be jointly diagonalized and from their components, we 
can recover the components of the (Mi). This reduc¬ 
tion does not change the modulus of uniqueness of the 
problem: the factor weights remain unchanged. 


Higher order tensors Finally, if we have a higher 
order (say fourth order) tensor T = J2i T^iai®bi®Ci®di 
then we can first determine the ai,bi by projecting 
into matrices T(/, I, w, u) = J^i '^{w^Ci){u^di)ai ® bi, 
and then determine the Ci,di by projecting along the 
first two components. Our bounds only depend on the 
dimension of the matrices being simultaneously diag¬ 
onalized, and thus this reduction does not introduce 
additional error. Intuitively, we should expect that 
additional modes of a tensor should provide more in¬ 
formation and thus help estimation, not hurt it. How¬ 
ever, note that as the tensor order increases, the noise 
in the tensor will presumably increase as well. 
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7 Convergence properties. 

The convergence of our algorithm depends on the 
choice of joint diagonalization subroutine. Theoreti¬ 
cally, the Jacobi method, the QRJID algorithm, and 
other algorithms are guaranteed to converge to a local 
minimum at a quadratic rate [27, 14, 29]. The ques¬ 
tion of global convergence is currently open [30, 25]. 
Empirically though, these algorithms have been found 
in the literature to converge reliably to global minima 
[27, 25, 30] and to corroborate this claim, we conducted 
a series of experiments [16]. 

We first examined convergence to global minima in 
the orthogonal setting. In 1000 trials of the Jacobi 
algorithm on random sets of matrices for various e 
and d = L = 15 , we found that the objective val¬ 
ues formed a Gaussian distribution around e (the best 
accuracy that can be achieved). Then, on each of 
our real crowdsourcing datasets, we ran our algorithm 
from 1000 random starting points; in every case, the 
algorithm converged to the same solution (unlike EM). 
This suggests that our diagonalization algorithm is not 
sensitive to local optima. To complement this empir¬ 
ical evidence, we also established that the Jacobi al¬ 
gorithm will converge to the global minimum when e 
is sufficiently small and when the algorithm is initial¬ 
ized with the eigendecomposition of a single projection 
matrix [16]. 

We also performed similar experiments in the non- 
orthogonal setting using the QRJID algorithm. Unlike 
Jacobi, QRJID suffers from local optima, which is ex¬ 
pected since the general CP decomposition problem is 
NP-hard. However, local optima appear to only affect 
matrices with bad incoherence values, and in several 
real world experiments (see below), non-orthogonal 
methods fared better their orthogonal counterparts. 

8 Experiments 

In the orthogonal setting, we compare our algorithms 
(OJDO, which uses random projections, and OJDl 
which uses with plug-in) with the tensor power method 
(TPM), alternating least squares (ALS), and with the 
method of de Lathauwer [23] . In the non-orthogonal 
setting, we compare de Lathauwer, alternating least 
squares (ALS), non-linear least squares (NLS), and 
our non-orthogonal methods (NOJDO and NOJDl). 

Random versus plug-in projections We gener¬ 
ated random tensors T — ^ Gaus¬ 

sian entries in tt, i? and Ui distributed uniformly in 
the sphere . In Figure 1, we plot the error 
Si=i ~ ^*11 2 (averaged over 1000 trials) of using 
L random projections (blue line), versus using L ran- 



Number of projections 

Figure 1: Comparing random vs. plug-in projections 
{d = k = 10, Cortho — 0.05, Cnonortho — 0.01) 

dom projections followed by plug-in (green line). The 
accuracy of random projections tends to a limit that 
is immediately achieved by the plug-in projections, as 
predicted by our theory. In the orthogonal setting, 
plug-in reduces the total number of projected matrices 
L required to achieve the limiting error by three-fold 
(20 vs. 60 when d = 10). In the non-orthogonal set¬ 
ting, the difference between the two regimes is much 
smaller. 

Synthetic accuracy experiments We generated 
random tensors for various d,k,e using the same pro¬ 
cedure as above. We vary e and report the average 
error ~ '^*11 2 across 50 trials. 

Our method realizes its full potential in the full-rank 
non-orthogonal setting, where OJDO and OJDl are 
up to three times more accurate than alternative meth¬ 
ods (Figure 2, top). In the (arguably easier) under¬ 
complete case, our methods do not achieve more than a 
10% improvement, and overall, all algorithms fare sim¬ 
ilarly (Figure 4 in the supplementary material). Alter¬ 
nating least squares displayed very poor performance, 
and we omit it from our graphs. 

In the full rank setting, there is little difference in per¬ 
formance between our method and Lathauwer (Fig¬ 
ure 2, bottom). In both the full and low-rank cases 
(Figure 2, bottom and Figure 5 in the supplementary 
material), we consistently outperform the standard ap¬ 
proaches, ALS and NLS, by 20-50%. Although we do 
not always outperform Lathauwer (a state-of-the-art 
method), NOJDO and NOJDl are faster and much 
simpler to implement. 

We also tested our method on the single topic model 
from Section 2.2. For d = 50 and k = 10, over 50 trials 
in which model parameters were generated uniformly 
at random in OJDO and OJDl obtained error 

rates of 0.05 and 0.055 respectively, followed by TPM 
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Figure 2: Performance on full-rank synthetic tensors. 

(0.62 error), and Lathauwer (0.65 error). Additional 
experiments on asymmetric tensors and on running 
time are in the supplementary material. 

Community detection in a social network 

Next, we use our method to detect communities in a 
real Facebook friend network at an American univer¬ 
sity [31] using a recently developed estimator based 
on the method of moments [4] . We reproduce a previ¬ 
ously proposed methodology for assessing the perfor¬ 
mance of this estimator on our Facebook dataset [31]: 
ground truth communities are defined by the known 
dorm, major, and high school of each student; em¬ 
pirical and true community membership vectors Cj,Ci 
are matched using a similarity threshold t > 0; for 
a given threshold, we define the recovery ratio as the 
number of true Ci to which an empirical Ci is matched 
and we define the accuracy to be the average £i norm 
distance between Ci and all the Ci that match to it. 
See [31] for more details. By varying t > 0, we ob¬ 
tain a tradeoff curve between the recovery ratio and 
accuracy (Figure 3). Our OJDl method determines 
the top 10 communities more accurately than TPM; 
finding smaller communities was equally challenging 
for both methods. 

Label prediction from crowdsourcing data 

Lastly, we predict data labels within several datasets 
based on real-world crowdsourcing annotations using 
a recently proposed estimator based on the method 
of moments [17]. We incorporate our tensor factor¬ 
ization algorithms within the estimator and evaluate 
the approach on the same datasets as [17] except one, 
which we could not obtain. In addition to the previ¬ 
ously defined methods, we also compare to the expec¬ 
tation maximization algorithm initialized with major- 



Figure 3: Accuracy/recovery tradeoff for community 
detection. 

Table 2: Crowdsourcing experiment results 


Dataset 

Web 

RTE 

Birds 

Dogs 

TPM 

82.25 

88.75 

87.96 

84.01 

OJD 

82.33 

90.00 

89.81 

84.01 

NOJD 

83.49 

90.50 

89.81 

84.26 

ALS 

83.15 

88.75 

88.89 

84.26 

LATH 

83.00 

88.75 

88.89 

84.26 

MV-fEM 

83.68 

92.75 

88.89 

83.89 

Size 

2665 

800 

106 

807 


ity voting by the workers (MV+EM). We measure 
the label prediction accuracy. Overall, NOJDl out¬ 
performs all other tensor-based methods on three out 
of four datasets and results in accuracy gains of up 
to 1.75% (Table 2). OJDl outperforms the TPM 
on every dataset but one, and in two cases even out¬ 
performs ALS and Lathauwer, even though they are 
not affected by whitening. Most interestingly, on two 
datasets, at least one of our methods matches or out¬ 
performs the EM-based estimator. 

9 Discussion 

We have presented a simple method for tensor fac¬ 
torization based on three ideas: simultaneous matrix 
diagonalization, random projections, and plugin esti¬ 
mates. Joint diagonalization methods for tensor fac¬ 
torization have been proposed in the past, but they 
have either been computationally too expensive [23] 
or numerically unstable [20]. We overcome both these 
limitations using multiple random projections of the 
tensor. Note that our use of random projections is 
atypical: instead of using projections for dimensional¬ 
ity reduction (e.g. [32]), we use it to reduce the order 
of the tensor. Finally, we improve estimates of the fac¬ 
tors retrieved with random projections by using them 
as plugin estimates, a common technique in statistics 
to improve statistical efficiency [33]. Extensive exper¬ 
iments show that our factorization algorithm is more 
accurate than the state-of-the-art. 
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Figure 4: Algorithm performance in the orthogonal setting. 
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Figure 5: Algorithm performance in the non-orthogonal setting. 


A Experiments 

A.l Synthetic experiments 

Orthogonal tensors We start by generating random tensors T = ^ + eR with Gaussian entries in 

TT, R and Ui distributed uniformly in the unit sphere We let d = 25, 50,100 and in each case consider two 

regimes: undercomplete tensors with k = 0.2d and full rank tensors, k = d. We vary e and report the average error 
||wi ~ Ui \\2 across all eigenvectors Ui and across 50 trials. In the orthogonal setting, we compare our algorithms 
(OJDO uses random projections, OJDl is with plugin) with the tensor power method (TPM), alternating least 
squares (ALS), and with the method of de Lauthauwer [23]. Alternating least squares displayed very poor 
performance, and we omit it from our graphs. In the undercomplete case (Figure 4, right), all algorithms fare 
similarly and errors are within 10% of each other. Our method realizes its full potential in the full-rank setting, 
where OJDO and OJDl are up to three times more accurate than alternative methods ((Figure 4, left). 

Non-orthogonal tensors In the non-orthogonal setting, we compare de Lathauwer, alternating least squares 
(ALS), non-linear least squares (NLS), and our non-orthogonal methods (NOJDO and NOJDl). We follow 
the same experimental setup as above and summarize our experiments in Figure 5. In the undercomplete setting, 
Lathauwer’s algorithm has the highest accuracy, about a 10% more than our approach (Figure 5, right). In the 
full rank setting, there is little difference in performance between our method and Lathauwer’s. In both settings, 
we consistently outperform the standard approaches, ALS and NLS, by 20-50% (Figure 5, left). Although we do 
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Figure 6: Algorithm performance on asymmetric tensors. 
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Figure 7: Number of flops performed by various algorithms. 


not always outperform Lauthauwer’s state-of-the-art method, NOJDO and NOJDl are faster and much simpler 
to implement. 


Asymmetric tensors Lastly, we evaluate the extension of our algorithm to tensors of size 50 x 50 x 50 having 
three distinct sets of asymmetric components (one in each mode). We find that performance is consistent with 
the symmetric setting, in both orthogonal and non-orthogonal regimes; our method outperforms is competitors 
by at least 25%, and in the non-orthogonal setting, it achieves an error reduction of up to 70% over Lathauer 
(Figure 6). 

A.2 Algorithm running time 

Figure 7 compares the running time in flops of the main algorithms. 

We obtain the plots in Figure 7 by calculating flops as follows. The Jacobi method performs at each sweep 
2dL(dk — ( 2 )) flops (where L is the number of matrices); the QRJl non-orthogonal diagonalization algorithm 
performs flops per sweep. The tensor power method performs a total of Lkd? flops (where L is the number 
of restarts), times the number of steps it takes to reach convergence for a given eigenvector. The flop count of 
Lathauwer’s method is much higher than that of other method’s: at one stage, it requires finding the SVD of a 
X matrix. Consequently, we do not include it in our summary. 
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B Proofs for orthogonal tensor factorization 


In this section we prove perturbation bounds for our algorithm in the setting of orthogonal tensors. 

Recall that we observe T = T + eR where T = where tt^ are factor weights, Ui G are orthogonal 

unit vectors and R is, without loss of generality, symmetric with ||i?||op = 1- Our objective is to estimate tt 
and {ui). Algorithm 1 does so by simultaneously diagonalizing a number of projections of T; we make use of 
projections along random vectors and along approximate factors. In this section we will show why both schemes 
recover and (ui) with high probability. 


Setup Let M. = {Mi,..., M^} be the projections of T along vectors wi ,..., wl, and M = {Mi ,..., Ml} be 
the projections of T along wi,... ,wl- We have that Mi = Ui)ui®Ui and that Mi = Mi + eRi, where 

Ri = R{I, I,wi). Thus, Ml are a set of simultaneously diagonalizable matrices with factors U and factor weights 
Aiz = Ui). From the discussion in Section 2, let C7 be a full-rank extension of U, with columns iti, U 2 , ■ ■ ■ Ud- 

Let TT and ft be a factorization of T returned by Algorithm 1. From Lemma 1, we have that 


\uj-Uj \\2 < e, 


\ i=i 


for j G [/c] where E G has entries 


~ S RlUi 




for i = j 
for i ^ j. 


(9) 


( 10 ) 


For notational convenience, let pij = {iriUi — TTjUj) so that Xu — Xji = wjpij. Let Vij = R{ui,Uj, I) so that 

uj RiUi = R{uj,Ui,wi) = R{ui,Uj,I)^wi = rj^wi. 


The expression for Eij when j i simplifies to, 


Eti ^7 

E7=i u’7P^JP7J'^l 


( 11 ) 


In the rest of this section, we will bound Ey for different choices of {u>z}/Gi. 


B.l Plugin projections 


In Section 4 we proposed using approximate factors Ui as directions to project the tensor T along. In this section, 
we show that doing so guarantees small errors in Ui. 

We begin by bounding the terms Eij. 

Lemma 3 {Eij with plug-in projections). Let wi,...,Wk be unit-vectors approximations of the unit vectors 
ui,... ,Uk: ||w/ — mz ||2 < 7 (so L = k), and let A4 = {Mi,..., Ml} be constructed via projection of T along 
Wi,..., wl- If the set of matrices Ai is simultaneously diagonalized, then to a first-order approximation, 


Eij — 






\\p^. 


-0(7). 


W(Ptj) = (ui + {wi - Ul))^{Tr^Ui - TTjUj) 

= TTiSil - TTjSjl -b {wi - Ul)^{TTiUi - TTjUj) 

< TTiSil - TTjSjl + Iki - Ul\\2\\TTiUi - TTjUj\\2 
= TTiSil - TTjSjl + 0 ( 7 ), 


Proof. We have that 
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where 5 ij = 1 if z = j and 0 otherwise. 

Thus, 

Ya^iwJ PijPjjWl 

_ Ef^i + 0(7)) rJjWi 

+ 0 ( 7 ))^ 

_ + 0(7) 

Trf + TT^ + 0(7) 

_ TT^r^-Mi + 'Ki{Wi - Ui)^ Tij - TTjrJjUj - TTjiWj - Uj^Tij + 0 {j) 

irf + 7r| + 0(7) 


Note that {wi — Ui)^Vij = 0(7) and {wj — Uj)^rij = 0(7), and hence both can be included in the 0(7) term. 

_ rjj {TT.Ui - TTjUj) + 0 ( 7 ) 

TTf+TT^+Oi-f) ■ 

Finally, recall that pij = {niUi — T^jUj) and that ||py |P = tt^ + 7r|. Combining this with the observation that 
yE: = 1 + a; + o(a;), we obtain 


Eij — 


Plj^ij 


\\P^. 


-0(7). 


□ 


Next, we use these term-wise bounds to bound the error in Ui. 

Theorem 5 (Tensor factorization with plugin projections). Let wi,... ,Wk be approximations of ui,... ,Uk such 
that ||wi — Ui||2 < 7 = 0 (e), and let M = {Mi,...,Ml} he eonstrueted via projection of T along wi,... ,wl- 
Then, for j S [k], 


||W, - W.,||2 < 


2v^ 


’^111'^max 


TTf 


e + o(e). 


Proof. From Equation 9 , we have that. 


\uj -Ujh < e 


\ 


E 


for all j S [fc]. By Lemma 3 , we get. 


Eij — 


Pjj^j 

IlKaP 


+ 0 {e), 


and thus. 


|2 ^ 


< e 


\ 


Plrij 




E 


o(e). 


2 
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Now, we must bound expect this the projection to mostly preserve the norm oipij because 

Tij are effectively random vectors. Using Lemma 10 with /r = 0, we get that < 4||7r|li7r,„ax- 

Finally, = nf + 7r| > tt^. 


\\Uj-Uj\\2 < 


< 






tt:" 


2^ 




TT'' 


e + o(e) 
e + o(e). 


□ 


B.2 Random projections 

Let us now consider the case when {wi}fL^ are random Gaussian vectors and present similar bounds. 

Given Equation 11, we should expect Eij to sharply, and now show that this is indeed the case. 

Lemma 4 (Concentration of error Eij). Let wi,..., wl be i.i.d. random Gaussian vectors wi ^ Af(0,1), and let 
Ai — {Ml ,..., Mi} be constructed via projection ofT along wi,... ,wl- If the set of matrices M. is simultane¬ 
ously diagonalized, then the first-order error Eij is sharply concentrated. If L > 161og(2(5), then with probability 
at least 1 — <5, 


„ ^ pJjnj 101og(2/J) ||r,j -||2 

VI WpVV 


Proof. The numerator and denominator of Equation 11 are both distributed as the sum of variables; we show 
below that they respectively concentrate about pJjVij and 

From Lemma 13, we have that the following hold independently with probability at least 1 — 6/2, 


1 « , 

P^oVjWi < pJjVij + \\p,j\\Wnj II 

^ Z=1 




PijpJjWl > \\p,jV 



21og(2/.5) \ 

VI ) 


Applying a union bound on both these events, we get that with probability at least 1 — <5, 



Note that with the given condition on L, ^ Using the property that when x < \, yby < 1 + 2a;, we 
have that 


1 ^ 41og(2/J) 

^ _ 21og(2/6) - ^ 
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Consequently, 


Eij < 


1 


(pijnj + Ibi,1121^1,112 




< 


< 


PijEj / 41og(2/J) \ ||r^j -||2 / log(2/(5) 

Ib^.lli V VZ y IlK.lbV L 

pJ^Tij 101og(2/5) ||ry ||2 

Ib^Zli Vl Ib^Zb' 


^ _l_ 41og(2/J) \ 

VZ J 


□ 


With this term-wise bound, we can again proceed to bounding the error Ui. 

Theorem 6 (Tensor factorization with random projections). Let Wi, ... ,wi^ be i.i.d. random Gaussian vectors, 
wi ~ JV{0, 1), and let Ai = {Mi ,..., M^} be constructed via projection of T along wi,..., wl- Furthermore, let 
L > 161og(2c?(fc — l)/(5)^, then, with probability at least 1 — <5, 


\Uj -Ujh < 




’^111’^max 




|^20y21og(2d(/c- l)/(5) 


VWA 

TTi J 


e-to(e). 


for all j G [/c]. 

Proof. From Equation 9, we have that, 


d 


Ud 


- Uih < e 


\ 


E rl + o{,). 


By Lemma 4, with probability at least 1 — 6/{d{k — 1)), 

p ^ 101og(2d(fc-l)/<5) llrijlla 

*^-|b.Zli Vl WpVW 


Applying a union bound over we have that with probability at least 1 — <5, 


bi - Uih < e 


\ 



101og(2d(A: — F)/5) 


E ^ 




lb' 




■o(e)2 


for all j G [fc]. We have used the fact that for a, 6 > 0, (a -I- 6)^ = + 2ab -|- 6^ < + {a'^ + b'^) + b"^ = 2a^ + 26^ 

and \/a + h < \/a + Vb. 

Note that Ibulb = ijV + 7r| > \tTi\. In Lemma 10, we show that - ^IbHiTTmax- Furthermore, 

Ibull Z 1 by the operator norm bound on R. Thus, we get. 


\Uj -Ujh < 


2i/2|j7r||i7rmax\ 

, ) 


{20V2\og[2d{k-l)/5) 


Vd/A 

TTi j 


e + o{e). 


□ 


C Proofs for non-orthogonal tensor factorization 


In this section we extend our previous analysis to non-orthogonal tensor decomposition. 
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Setup As before, let A4 = {Mi,..., Ml} be the projections of T along vectors wi,... ,wl, and M = 
{Ml,... ,Ml} be the projections of T along wi,... ,wl- We have that Mi = Ui)ui (8) Ui and that 

Ml = Ml + eRi, where Ri = R{I, I,wi). Thus, Mi are a set of simultaneously diagonalizable matrices with 
factors U and factor weights Xu = Tri{wjui). Let U be the full-rank extension of U with unit-norm columns 
ui,U 2 , ■ ■ ■ ,Ud- In this setting, however, the factor U is not orthogonal. Let V = U~"^, with rows vi,V 2 ,. ■. ,Vd- 
Note that we place our incoherence assumption on the columns of U and present results in terms of the 2-norm 
of . When U is incoherent, it can be shown that || || 2 < l + 0 (/i) . Finally, note that in the orthogonal case, 

when fi = 0, the rows (vi) and columns (ui) are identical, and no distinction between the two need be made. 

Let TT and u be a factorization of T returned by Algorithm 1. From Lemma 2, we have that 


\\Uj - Uj\\2 



where the entries of A G are bounded by Lemma 16: 


\Eij\ < 


1 - Pi. 




L 

1=1 


R[Vj Xil 


( 12 ) 


where Xi G is the vector of Ath factor values of Mi, i.e. Xu is the i-th factor value of matrix Mi (i.e. 
Xu = {Ai)ii) and ptj = modulus of uniqueness, is a measure of the singularity of the problem. 

When Xu is generated by projections, Xu = niwjui. Let = R{vi,Vj,I) so that 

vj RiVj = R{vi,Vj,wi) = R{vi,Vj,I)^ wi = rj^wi. 

Note that \\rij \\2 < ||ui|| 2 ||ujII 2 < \\V^\\l 
Equation 12 then simplihes to. 


1-p?. VllAdli ' llA.Ili 


where ||Ai ||2 = T^iJ2iLi wjUiuJwi, and pij has the following expression, 




Fi 


'^wJu.rJjWi 


1^1 


(13) 


Pij — 


xjx, 


= l wi UiU Wl 


I|a.|| 2 ||a ,||2 


AII 2 ^iT,Li wju.uj w;)(Ef=i wjUjuJwi) 


(14) 


Observe that the terms Ui interact with the factor weights Xu, while the terms Vi interact only with the noise 
terms Ri. 

In the rest of this section, we will bound Eij and pij with different choices of {wi}f"^i. 


C.l Plugin projections 


We now assume we have plugin estimates {wi) that are close to the inverse factors (vi): ||rci — Ui ||2 < 0 ( 7 ) for 
I € [A:]. Then, 


wjui = {vi + {wi - vi)yut 


= vJUi -I- llui; - Vi\\2 


(Wi - VlYuj 
-Vl\\2 


= vjui + 0{^). 


Recall that V = U so vJui = Su. 
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It will be useful to keep track of |lAi|||, 


l|Ai|l 2 = Uif 

1^1 

k 

= + 0{j)f 

1^1 

= T^i+0{j). (15) 

Lemma 5 (Modulus of uniqueness for plugin projections). Let Wi,...,Wk he approximations ofvi,...,Vk: 
\\wi — vi \\2 < 0(7) for I S [k], and let A4 = {Mi,... ,Ml} be constructed via projection of T along wi,... ,wl- 
Then, for i^j, 

Pij <0(7), 


Proof. Let us first bound the numerator of Equation 14. 

/ L 


{>H Aj)^ = wJu^uJW ij 

'^vjuiujvi + 0(7) j 


— ^ 2^2 
TT^ TTj 


= + 0(7) 

= 0 ( 7 ). 


Using Equation 15, we get that 


pI = 


0{l) 


(l + 0(^))(l + 0(q)) 

= 0 ( 7 ). 

where in the last line we used the fact that = 1 + a: + o(a;). 


□ 


Lemma 6 (Bound on Eij for non-orthogonal plugin projections). Let wi,... ,Wk be approximations ofvi,...,Vk: 
||w; — a;i ||2 < 0 ( 7 ) for I G [k], and let A4 = (Mi ,. he constructed via projection ofT along wi,... ,wl- 


K-|< + WV^h pJjnj+0{^), 


where = 


Pv = 1^*1 HUTU + 


Proof. Let us bound each term within our expression for Eij (Equation (13)). 


E = E + Oil) 

1^1 

< rJjVj +0(7)- 


Z=1 


E < rJ^Vi + 0(7), 


1^1 


Similarly, 
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From Equation (15), we have 


From Lemma 5 we have that 


l|A,l|^ = Kf+ 0(7). 


pI < 0(7) 

1 ^ 1 
- 1 - 0(7) 


+ 0(7) 


< 1 + 0(7). 


Finally, 


\Eij\ < 

< 



^ + \Trj\VjyTij) + 0 ( 7 ) 

^ + 0(7). 


□ 


Note that the error terms depend not on Ui but rather Vi. This is because the projections (wi) are chosen to be 
close to the Vi. Now, let us bound the error in Ui. 

Theorem 7 (Non-orthogonal tensor factorization with plug-in projections). Let wi,... ,Wk be approximations 
ofvi,... ,Vk: \\wi — W/II 2 < 0(e) for I G [fc] and let Ai = {Mi,... ,Ml} be constructed via projection of T along 
wi,...,wl- Then, for all j G [k], 

\\Uj-Uj\\2 < 8e 

7T 

min 


Proof. From Lemma 15 we have that 


\uj - Uj \\2 < e. 


\ 1=1 


for j G [fc], where Tij is bounded in Lemma 6 as follows: 

'1 1 ' 


I I ^ 


+ 72 11^ II 2 PiAii + 0(e) 




^ 72—11^ II 2 +0(e). 


Consequently, 


\uj -Uj \\2 < e 


\ 


T.e. 




< 


2 e 


TT^ ■ 
min 


\ 


(lll^^lb pjjrij + 0{e)f + o(e) 

*7^1 


< 


4e 


TT^ ■ 
mm 


A 


Y (11^^112 pjjrij)‘^+ I +o(e), 
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where we have used the fact that (a + < 2(a^ + 6^) and that ^/a + b < ^/a + Vb. 

From Lemma 10 we have, pJjTij < 4||7r||i7rmax||l^^||2i 

llfii - Uj \\2 < ("^dllTTlIiTTmaxlll^^ll®) + o{e) 

^min \ / 

< 8 e ^/S^^\\V^\\l+o{e). 

^min 


□ 


C.2 Random projections 


We now study the case where the random projections, (ic;), are drawn from a standard Gaussian distribution. 
First let us show that the modulus of uniqueness pij sharply concentrates around ul uj. 

Lemma 7 (Modulus of Uniqueness with random projections). Let Wi,- ■ -w^ € be entries drawn i.i.d. from 
the standard Normal distribution. Let L > 161og(3/(5)^ Then, with probability at least 1 — <5, 


Pij < uJ Uj + 


101og(3/d) 

Vl 


Proof. Observe from Equation 14 that the numerator and the denominator of pij are essentially distributed as 
a distribution (Lemma 13). Thus, with probability at least 1 — i5/3 each, the following hold. 


1 

L 


L 

UiuJ Wi < ujuj + ||u,||2|kj||2 

/=1 







21og(3/(5) \ 

Vl ) 


]^'^{wjujf > ||Uj||2 



21og(3/5)\ 


Noting that ||ui ||2 = ||mj ||2 = 1 and applying a union bound on the above three events, we get that with 
probability at least 1 — 


Pij < 




1 - 


21og(3/<5) 

Ul 


Under the conditions on L, ' < 4. Applying the property that when x < ^, jZy < 1 + 2x, 

1 


1 _ 21og(3/<5) — 

Ul 


, 41og(3M) „ 

< IH-< 2. 


Vl 


Finally, 




^ 41og(3/(5) \ 

Vl J 


<UiUj I 1 + 




Vl J 

T , 101og(3/5) 


+ 3^/i^x2 
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Let’s now bound the inverse modulus of uniqueness. 

Lemma 8 (Bounding inverse modulus of uniqueness). Let G K'’* be entries drawn i.i.d. from the 

standard Normal distribution. Assume incoherence /i for that the (ut): uj Uj < /i for i ^ j. Let Lq = ^ 2 ) ^ 

Let L > Lolog(3/d)^. Then, with probability at least 1 — (5, 


< 


i-Py- 


1 + 


^ log(3/d)j . 


Proof. From Lemma 7, we have that with probability at least 1 — d, 

T , 101og(3/d) 

Pij < Ui Uj + 


Vl 


Then, 


pi < iuJujf + 2uJuj 


10 


log(3/d) \ / 101og(3/d) y 

Vl Vl ) ■ 


Given the assumptions on L, we have that L > Lolog(3/d)^ > 501og(3/i5)^ and thus 

’101og(3/(5)\ 1101og(3/(5) 


Pij<{u,uj) +2 

= (W UjY + 


y/L J 


Vl 


T ,2 , 251og(3/(5) 


Vl 


Now, we bound 2 , 
Pij 



< 


< 


1 

1 


25 log(3/5) 

7l 

1 


1 - VJujY 1 


25 log(3/(5) 
{1-{uJuY)V'/L 


< 


< 


1 1 

1 — {uJ UjY 1 — 251og(3/.5) 

1 1 


1 - i_ 1 



Again, given assumptions on L, ^ 



Using the identity that if cc < jV, ^ 1 + 2a;, 


1 1 
1 - Pfj “ 1 - VjujY 


+ log(3/d) 



□ 


We are now ready to bound the termwise entries of E. 

Lemma 9 (Concentration of Ejj). Let wi,... ,wl be i.i.d. random Gaussian vectors wi ~ A7(0,/), and let 
Ai = {Ml,..., Ml,} he constructed via projection of T along wi,... ,wi. Assume incoherence p for that the 
{ui): uJUj < p for i j. Furthermore, let L> Lq log(15/d)^. Then, with probability at least 1 — d, 

IP. , . ( PljGi , TT^iWr^ih {20 + VL^)logil5/6) \ 

J yi - {uJ UjY VL J’ 

where pij = \Tri\ui + \TTj\uj and itij = \'Ki\ + \'nj\. 
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Proof. Each term in Equation 13 concentrates sharply about its mean value. We bound each in turn. 
First, consider ||Ai|||/L = With probability at least 1 — d/S each, the following hold. 


1 II \ l|2 ^ _2||„, ||2 21og(5/l5)^ 

II Aj II 2 — 11%'II 2 J ■ 


Thus, using the fact that \\ui\W = 1, 




1 -I- 1 


iiA.iii ' iiA.iii; ^ i_2io^ 




Given our assumption on L, it follows that ^ Thus we can use the fact that < 1 + 2a; when 


a; < ^ to obtain the following bound: 


l|A.|li l|A 


■3 112 


< I 1 + 1 Ui + E2gd) V 


vz ) 


Next, we bound wjUirJ.jWi and 'UjZ’J.jWi. From Lemma 13, we have with probability at least 

1 — d/5 each. 


UjrJ^wi < rJjUj + ||r,j||2||wjII 2 ^3^^ 

< rTui + ||ry||2||M*||2 (sJ 

^ i=i \ 


lo g(5/d) ^ 


Note that by definition, ||ui ||2 = 1- 

Using Lemma 8, we have that with probability at least 1 — d/5, 


1 ^ -1 I' T A 2 " ( 1 + \/^^°s(15/d)y 

1 - pP 1 - {ul UjY \ \l L J 


1 1 

< 


Putting it all together, we get that with probability at least 1 — d. 


\Eij\ < 


1 


^ 1 - (uju,)^ t' ^ (if + ;|) (' + vz J 


41og(5/d) A 


\TTi\rJjUi + \TTj\rJjUj + {\TTi\ + |7rj|)||ry||2 3 


log(5/d) j 


Let us define ptj = \’ni\ui + \TZj\uj and tt^ = Itt^I + |7rj|: 


\Eij\ < 


1 


+ TTtjWrijh 


(uju,y t' + (if + ;|) (' + ^/i J 


41og(5/d) A 


Given that L > Lq log(15/d)^, we have that y ^ log(15/d) < 1 and thus 


1+ yy iog(i5/d)^ y +<2x2 


< 4 . 





























Volodymyr Kuleshov*, Arun Tejasvi Chaganty*, Percy Liang 


Finally, note that \Tri\rJ^Ui + \TTj\rJjUj < (|7ri| + \TTj\)\\rij\\ 2 , giving us, 


\Eij\ < 



7T]) l-iuju,r 


< 


,[ ^J\\2 

1 \ / Pjjnj 


7T,,\\r,,h (20 + v/I^)log(15/J) \ 

1-K^Wj)2 y/L )' 


□ 


Finally, we bound the error in estimating Uj. 

Theorem 8 (Non-orthogonal tensor factorization with random projections). Let wi,... ,wl be i.i.d. random 
Gaussian vectors, wi ^ JV{0,1), and let A4 = {Mi, ..., M^} he constructed via projection ofT along wi, ..., wl- 

Assume incoherenee fj, for both (ui) and (vi): uJuj < /i and vjVj < p, for i ^ j. Let Lq = . Let 

L > Lq log(15c?(fc — l)/(5)^. Then, with probability at least 1 — <5 and for e small enough. 


\\Uj -Uj\\2 < 


VlklliTTmax 

1 - ’"min 




where 0(6) = log(15(d(/c - l))/(5). 


Proof. From Lemma 15 we have that 


\ui - Ui \\2 < e 


\ 


d 

Y.P 


3(e), 


for j e [k]. 

Using Lemma 9, we have that with probability at least 1 — 5/{d(k — 1)), 


\L6ij\ < 


A ^ O 




< 


< 


min 

2 


1 


7r,,-||r,,||2 (20 + v/Io)log(15(d(fc-l))/^) \ 
l-(uluj)^ ' VL j 

T 2 (20 + log(15(d(A: - l))/(5) 


1 — fp 

1 

1 — /r2 


Pipij 2|7rnii„|||U 


{plr,,+2\T:^,^\\\V^\\lC(5)), 


Vl 


where we have defined C((5) = log(15(d(A: — l))/5) and are using the fact that ujuj < and iTij = 

l^i| \'^j\ — ^I'^maxl- 
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Applying a union bound on all the entries of Eij, we arrive at the following bound for all j. 


\uj -Uj\\2<e 

V 


- tA 

min 


2e 




- ^2 


4e 

4e 


+ 27rmax||AT||lC'(d) 


J2{p!,n,y + 2n^,^\\V^\\lC{6), 


A 


El 


^3 


- ttA ^ 

min V 


Y^iPljn^r + 27r„..JV^\\lC{6)Vdj . 


where we use the fact that (a + &)^ < 2(a^ + 6^). 

By Lemma 10 we also have, Yl^iiPijrijY < 4||7r||i7ri„ax||A^||t- Finally, note that TTmax < V^7r„J^: 

\\\V^\\l + 2i:^,^\\V^\\lC[5)Vd) 


4e 




"^111’^maxll 


< 


'7^111’^max 


VH 

1 - Tl-mii, 


||t/^||2(l + C(d)yd). 


□ 


D Proofs of auxiliary lemmas 

In this section, we prove some auxiliary results that appear as intermediate steps in the main lemmas above. 
Lemma 10 (Bounding p^r^). Let pij = iTiUi — iTjUj € K'^ and = R{vi,Vj,I) G where R is a tensor with 
unit operator norm and where (ui) G R'^ are unit vectors and (vi) G R'^ form the columns of the matrix V with 
bounded 2 norm. Then, 

d 

< 47rmax||7r||i||l/||^. 

Proof. Firstly, note that it is trivial to bound the sum as follows, 

d d 

<EllPdll2lkLll2 

<4(d-l)7rLxl|A||4, 

using the properties that Pij = iTiUi — T^jUj and that R has unit operator norm and thus ||pij ||2 E 27rinax and 
\\r.A2 = \\Riv^,v„I)h<\\V\\l 

However, we would like a tighter bound with a lower-order dependence on k. To do so, let us expand pij, 

d d 


d 




E' 

[TTiR{vi,Vj,Ui) - 

- ■KjR{vi,Vj,Uj)Y 






d 


d 

d 

E 

TrjR{vi,Vj,Ujf 


- E 2TTiTTjR{Vi, Vj,Ui)R{Vi, Vj,Uj) 
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Using the assumption that R has unit norm, the latter two terms can be bounded by ||7r||2||U|j2 and 2 tt ^ ||7r||i||U||| 
respectively. 

We now focus on the first term, R{'Vi,Vj,UjY. Note that R{vi,Vj,Uj) = R{I,Vj,Uj)^Vi = fjvi, where 

fj = R{I,Vj,Uj) and |]fj ||2 < ||U|j 2 by the operator norm condition on R. 


J2irJv.r = \\Vf,\\l 



Put together, we get that, 

d 

< ’^111^112 + Ikll2ll^^ll2 + 2^.|kl|l|l^ll2- 

Finally, < TTmaxlkHi and, by Holder’s inequality, ||7r|j^ < TTmaxlkHi, giving us, 

d 

Y^ipJjnjf < 47rmax||7r||i||U|||. 

□ 


E Concentration Inequalities 

In this section, we present several concentration results that are key to our results. The bounds presented 

in Laurent and Massart [34] play a key role and are reproduced below. 

Lemma 11 (x| tail inequality). Let q Xk distributed as a chi-squared variable with k degrees of freedom. 
Then, for any t > 0, 

P(g — k > 2^/kt + 2t) < 

P(A: — q > 2 Vkt) < e~*. 


Alternatively, we have that with probability at least 1 — i5, 


and similarly, with probability at least 1 — <5, 


(16) 


(17) 


Proof. See Laurent and Massart [34, Lemma 1]. □ 

Lemma 12 (Gaussian quadratic forms). Let x ~ A/'(0, 1) G be a random Gaussian vector. If A is symmetric. 
Ax is distributed as the sum of d independent x^ variables, where Xi are the eigenvalues of A. 

Proof. Let A = J2i=i b® tbe eigendecomposition of A. Then, x^Ax = J2i=i However, UiTxi 

is distributed as independent x\ random variables. Thus, x^ Ax = D 

Lemma 13 (Gaussian products). Let Xi ~ Af{0,I) G M'’* fori = 1,...,L be random Gaussian vectors. Let 
L > 41og(l/(5). Then, 
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1. where a G «s distributed as Haljlxi- Consequently, with probability at least 1 — <5, 

I s iioii^ (i+ 




<|ia||^ 1 + 3 


log(l/<^) j 


^ ||„||2 21og(l/(5)^ 

/ A ^i — 11^112 [ ^ I ■ 


2=1 


2. X]i=i Xi a,b G K'^ and a ^ b is sharply concentrated around a^b: with probability at least 1 — <5, 

< a^&+ |ia||2||6||2 

<a^&+|ia||2||6||2f 


log(l/<^) j 


Proof. The first part follows directly from Lemma 12 and the t^^il bound, Lemma 11. 

For the second part, let A = . Note that xjab^Xi = xjAxi. Then, by Lemma 12, xjAxi = AiXi + A 2 X 1 , 

where Ai and A 2 are the eigenvalues of A. Furthermore, because A = , one of Ai or A 2 is negative, and 

the other is positive. Without loss of generality, let Ai > 0 > A 2 . 

Applying the x^ tab bound. Lemma 11, we get that with probability at least 1 — <5, 


AiXi + Ai(l + 2 
|A2|x?>|A2|(i- 


log(2/<^) , ^ log(2/^) , 
L L ’ 

21og(2/J) 

Vl 


Applying a union bound, we get, 

L 


< A,(l + + a,(1 - 


< 


(A, + A.) + lA.I + +iM)) + |A.|+^ 


< (Ai + A 2 ) + (|Ai| + IA 2 I) 2 


log(2/^) 21og(2/J) \ 

L L j 


Observe that Ai + A 2 = tr(A) = aJb. Similarly, |Ai| + IA 2 I = ||A||* = 2(l||a||2||&||2)- Thus, we finally have that 
with probability at least 1 — i5, 


t a;7 ab^Xi < b + ||a||2||6|i2 
2=1 




21og(2/J) ^ 


□ 


F Perturbation bounds for joint diagonalization 

In this section, we present minor extensions to the perturbation bounds of Cardoso [28] and Afsari [24] so that 
they apply in the low-rank setting. 
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Notation Let Mi = UAiU^ + eRi for I = 1, 2,..., L be a set of d x d matrices to be jointly diagonalized. 


Az G 


p fcx fc 


is a diagonal matrix, Ri G 


is an arbitrary unit operator norm matrix and e is a scalar. In the 


orthogonal setting, U G is orthogonal, while in the non-orthogonal setting U G is an arbitrary matrix 

with unit operator norm. Let Xu = {Ai)i be the i-th factor weight of matrix M/. Finally, we say that a set of 

matrices {Mi, • • • , Ml}, Mi = J2i=i joint rank fc if {i\ |Aiz| >0} = k. 


Lemma 14 (Cardoso [28]). Let Mi = UAiU^ + eRi, I G [L], be matrices with common factors U G and 

diagonal A; G Let U G be a full-rank extension of U with columns ui,U 2 , ■ ■ ■ ,Ud and let U G 

be the orthogonal minimizer of the joint diagonalization objective F{-). Then, for all Uj, j G [k\, there exists a 
column Uj of U such that 


where i? G is 


\uj - Uj \\2 < e. 


\ i=i 


^ ^jl)^j Fl'^i 




when i ^ j and i < k or j < k. We define E^j = 0 when i = j and Xu = 0 when i > k. 


(18) 


(19) 


Proof. See Cardoso [28, Proposition 1]. Note that in the low rank setting, the entries of Eij (Cardoso [28, 
Equation 15]) where i,j > k are not defined, however, these terms only effect the last d — k columns of U. The 
bounds for vectors ui,...,Uk only depend on Eij where i G [d] and j G [A:], and these are derived in the low-rank 
setting in the same way as they are derived in the full-rank proof of Cardoso [28]. □ 


We now present the corresponding perturbation bounds in Afsari [24] to the low rank setting. 

Lemma 15 (Afsari [24]). Let Mi = UAiU^ -|- eRi, I G \L], be matrices with common factors U G and 

diagonal A; G . Let U G be a full-rank extension of U with columns ui, U 2 ,. ■. ,Ud and let V = U~^, 

with rows vi,V 2 , ■ ■ ■ ,Vd. Let V G be the minimizer of the joint diagonalization objective F{-) and let 

u = v-\ 

Then, for all uj, j G [k], there exists a column uj of U such that 


\uj - Ujh < e 


\ 


d 

i=l 


o(e), 


( 20 ) 


where the entries of E G satisfy the equation 


Eij 

-1 


Pij 


T ' 

Eji 

7ii(l - pIj) 

_—Pij 

- 


T 


when i j and either i < k or j < k. When i = j, Eij = 0. The matrix T has zero on-diagonal elements, and 
is defined as 


Tij — Vj RiVjXji, 
i 


for 1 < J ^ < d 


and the other parameters are 


lij — ||Ai|[ 2 |[Aj|| 2 , 


Vij — 


IIA.II2 

IIA1II2’ 


IIA1II2IIA.II2’ 




ik • 


We define Xu = 0 when i > k. 
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Proof. In Afsari [24, Theorem 3] it is shown that V = {I + eE)V + o(e), where Eij is defined for i,j G [d] (Afsari 
[24, Equation 36]). Then, 

U = U{I + eE)-^ + o{€) 

= U{I-eE) + o{e). 

Note that, once again, in the low rank setting, the entries of Eij when i,j > k are not characterized by Afsari’s 
results; however, these terms only effect the last d — k columns of U. □ 

Lemma 16. Let Mi = UAiU^ + eRi, I G [L], be matrices with common factors U G and diagonal 

A; G Let U G be a full-rank extension of U with eolumns ui,U 2 ,... ,Ud and let V = U~^, with rows 

vi,V 2 , ■ ■ ■ ,Vd. Let V G be the minimizer of the joint diagonalization objective E{-) and let U = V~^. 

Then, for all Uj, j G [k], there exists a column Uj of U such that 


\Uj - Uj\\2 < € 




where the entries of E G are bounded by 

, 1/1 1 

\Eij\ < 


1 - pf, V II A. 


ill 2 llAjlli 




E vj RiVjXj 


( 21 ) 


1^1 


y^vjRiVjXii 


when i ^ j and E^ = 0 when i = j and Xu = 0 when i > k. Here Xi = {Xu, Xi 2 ,Xil) G and pij 
is the modulus of uniqueness, a measure of how ill-conditioned the problem is. 


A, A, 


llA.IbllA, 


Proof. From Lemma 15, we have that 


Ei 

eI 


< 


where 


lij — ||Ai||2||Aj||2, 


7j*(l - pI,) 
IIA.II2 


piJ = 


I|A,||2’ ||A,I|2||A||2’ 

and the matrix T is defined to be zero on the diagonal and for i ^ j defined as 

L 


K A, 


Tij — E! RiVjX 


•jG 


tor 1 < j ^ i < d 


Taking || • || to be the /i-norm in the above expression, we have that 


Since 


and 


\E.j\ < \E.j\ + \E,.\ < + \T,^\) ■ 

Pij) 

Pij + rjji ^ IIAilli + llAjlll ^ 1 1 

7,. l|A.|lil|A,||i IIAIIi ^ ||A,||i 

L 

Tij = E! RlVjXjl, 


the claim follows. 


□ 

































